{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b",
   "metadata": {
    "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b"
   },
   "source": [
    "**LLM Workshop 2024 by Sebastian Raschka**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58b8c870-fb72-490e-8916-d8129bd5d1ff",
   "metadata": {
    "id": "58b8c870-fb72-490e-8916-d8129bd5d1ff"
   },
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# 6) Instruction finetuning (part 3; benchmark evaluation)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "013b3a3f-f300-4994-b704-624bdcd6371b",
   "metadata": {},
   "source": [
    "- In the previous notebook, we finetuned the LLM; in this notebook, we evaluate it using popular benchmark methods\n",
    "\n",
    "- There are 3 main types of model evaluation\n",
    "\n",
    "  1. MMLU-style Q&A\n",
    "  2. LLM-based automatic scoring\n",
    "  3. Human ratings by relative preference\n",
    "  \n",
    "  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfc0f1d9-f72b-4cf7-aef6-0407acf1a46e",
   "metadata": {},
   "source": [
    "<img src=\"figures/10.png\" width=800px>\n",
    "\n",
    "<img src=\"figures/11.png\" width=800px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c389e66f-5463-4e40-820c-55d1e3eb5077",
   "metadata": {},
   "source": [
    "\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "\n",
    "<img src=\"figures/13.png\" width=800px>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0070e756",
   "metadata": {},
   "source": [
    "<img src=\"figures/14.png\" width=800px>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45464478-d7b1-4300-b251-a6e125c8ae52",
   "metadata": {},
   "source": [
    "## https://tatsu-lab.github.io/alpaca_eval/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de7ba804",
   "metadata": {},
   "source": [
    "<img src=\"figures/15.png\" width=800px>\n",
    "\n",
    "## https://chat.lmsys.org"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e084ce8a-bcba-4c67-ad49-0a645ce65eb7",
   "metadata": {
    "tags": []
   },
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# 6.2 Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d0f6422-582f-4996-ae2a-dbacc9ebf86d",
   "metadata": {},
   "source": [
    "- In this notebook, we do an MMLU-style evaluation in LitGPT, which is based on the [EleutherAI LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)\n",
    "- There are hundreds if not thousands of benchmarks; using the command below, we filter for MMLU subsets, because running the evaluation on the whole MMLU dataset would take a very long time"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67bb4cb5-e8ea-445b-8862-e7151c1544fc",
   "metadata": {},
   "source": [
    "- Let's say we are intrested in the `mmlu_philosophy` subset, se can evaluate the LLM on MMLU as follows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4edc913",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# Exercise 3: Evaluate the finetuned LLM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "773a25be-5a02-477b-bfea-ffd53e44647b",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "6cd718c4-0e83-4a83-84f8-59e3fc4c3404",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'batch_size': 4,\n",
      " 'checkpoint_dir': PosixPath('out/finetune/lora/final'),\n",
      " 'device': None,\n",
      " 'dtype': None,\n",
      " 'force_conversion': False,\n",
      " 'limit': None,\n",
      " 'num_fewshot': None,\n",
      " 'out_dir': None,\n",
      " 'save_filepath': None,\n",
      " 'seed': 1234,\n",
      " 'tasks': 'mmlu_philosophy'}\n",
      "2024-07-04:00:57:13,332 INFO     [huggingface.py:170] Using device 'cuda'\n",
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
      "2024-07-04:00:57:18,981 INFO     [evaluator.py:152] Setting random seed to 1234 | Setting numpy seed to 1234 | Setting torch manual seed to 1234\n",
      "2024-07-04:00:57:18,981 INFO     [evaluator.py:203] Using pre-initialized model\n",
      "2024-07-04:00:57:24,808 INFO     [evaluator.py:261] Setting fewshot random generator seed to 1234\n",
      "2024-07-04:00:57:24,809 INFO     [task.py:411] Building contexts for mmlu_philosophy on rank 0...\n",
      "100%|████████████████████████████████████████| 311/311 [00:00<00:00, 807.98it/s]\n",
      "2024-07-04:00:57:25,206 INFO     [evaluator.py:438] Running loglikelihood requests\n",
      "Running loglikelihood requests:   0%|                  | 0/1244 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)\n",
      "Running loglikelihood requests: 100%|██████| 1244/1244 [00:07<00:00, 158.49it/s]\n",
      "2024-07-04:00:57:33,515 WARNING  [huggingface.py:1315] Failed to get model SHA for /teamspace/studios/this_studio/out/finetune/lora/final/evaluate at revision main. Error: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/teamspace/studios/this_studio/out/finetune/lora/final/evaluate'. Use `repo_type` argument if needed.\n",
      "fatal: not a git repository (or any parent up to mount point /teamspace/studios)\n",
      "Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).\n",
      "|  Tasks   |Version|Filter|n-shot|Metric|   |Value |   |Stderr|\n",
      "|----------|------:|------|-----:|------|---|-----:|---|-----:|\n",
      "|philosophy|      0|none  |     0|acc   |↑  |0.5691|±  |0.0281|\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!litgpt evaluate out/finetune/lora/final --tasks \"mmlu_philosophy\" --batch_size 4"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "V100",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
